<<<<<<< HEAD Tutorial Guide for Stats II Wk2

GLMs in R

In today’s class we’re going to look at the generalised linear model, which is what we’ll be using for the rest of the module. The GLM allows us to extend the principle of linear regression to different kinds of outcomes using a link function. This is simply an equation that links linear predictors to non-linear outcomes, for instance “yes/no”, counts, etc.

The goal today is more theoretical than usual for our tutorials: we will introduce the glm() function in R, and try to see how it works ‘under the hood’ using what is known as maximum likelihood estimation. MLE is just a way of estimating the parameters of a distribution from the available data; in certain circumstances, the ordinary least squares method we used last term satisfies the objective of MLE.

Learning Outcomes

By the end of today’s class you should:

  1. Know how to use the glm() function in R.
  2. Understand a little about how GLM works, and how it differs from linear regression.
  3. Regret not having learned more calculus.

The glm() function

The glm() function is very similar to the lm() function we’re already familiar with. In fact, you can run a linear regression with the glm() function, because linear regression is part of the same family as the generalised linear model.

stargazer::stargazer(lm(Sepal.Length ~ Sepal.Width + Species, data = iris), type = "html")
Dependent variable:
Sepal.Length
Sepal.Width 0.804***
(0.106)
Speciesversicolor 1.459***
(0.112)
Speciesvirginica 1.947***
(0.100)
Constant 2.251***
(0.370)
Observations 150
R2 0.726
Adjusted R2 0.720
Residual Std. Error 0.438 (df = 146)
F Statistic 128.888*** (df = 3; 146)
Note: p<0.1; p<0.05; p<0.01
stargazer::stargazer(glm(Sepal.Length ~ Sepal.Width + Species, data = iris,
                         family = "gaussian"), type = "html")
Dependent variable:
Sepal.Length
Sepal.Width 0.804***
(0.106)
Speciesversicolor 1.459***
(0.112)
Speciesvirginica 1.947***
(0.100)
Constant 2.251***
(0.370)
Observations 150
Log Likelihood -87.968
Akaike Inf. Crit. 183.937
Note: p<0.1; p<0.05; p<0.01

See how the glm() function, when we specify the family = argument to “gaussian”, gives us almost the same output as the lm() function. This is because linear regression is based on the normal, or Gaussian, distribution. What do you notice is different though?

stargazer::stargazer(glm(am ~ cyl + disp, data = mtcars,
                         family = "binomial"), type = "html")
Dependent variable:
am
cyl 0.743
(0.719)
disp -0.028*
(0.014)
Constant 0.799
(1.992)
Observations 32
Log Likelihood -14.293
Akaike Inf. Crit. 34.586
Note: p<0.1; p<0.05; p<0.01

If we want to use glm() to perform logistic regression, we simply specify the distribution for which logistic regression estimates the parameters, i.e., the binomial distribution.

Logistic (or logit) regression

Let’s quickly review what we use logistic regression for, and how it works. If we have an outcome that can only be true or false, yes or no, then it follows that our estimate for this outcome needs to be bounded between zero (for false/no) and one (for true/yes). It would make no sense to have an estimate less than zero, or more than 1; moreover, if we did get such an estimate, it would suggest that our model was somehow wrong, and that all our other predictions might be off.

Basically, we need a distribution that falls between zero and one, and we need to be able to link the coefficients of our predictors to this distribution. Here’s the distribution:

x <- 1:1000
plot(pbinom(x, size = 1000, prob = 0.2), type = "s", col = "blue",
     main = "Binomial distribution function",
     xlab = "Number of successes", ylab = "F(x)")
lines(pbinom(x, size = 1000, prob = 0.4), type = "s", col = "red")
lines(pbinom(x, size = 1000, prob = 0.6), type = "s", col = "green")
legend("bottomright", 
       legend = c("0.2", "0.4", "0.6"), 
       col = c("blue","red","green"),
       title = "probability")

And here’s the link function:

\[\log\frac{p}{1-p} = \beta X\]

Remember: all the link function does is convert from a probability (0 to 1 scale), which has a curve shape, to log odds (\(-\infty\) to \(\infty\) scale), which is a straight line. We typically use log odds for \(\beta X\) (literally, the log of the odds) because these are symmetrical and normally distributed. Taking the log of the odds also converts the terms in our model from a multiplicative to an additive relationship (i.e., we add the terms rather than multiply them), which is similar to what we are familiar with from linear regression.

It is easy to get confused in logistic regression. There is a difference between odds and odds ratio, log (odds) and log (odds ratio), the logit link function used to convert between probabilities and log odds, and the log likelihood transform used in maximum likelihood estimation to fit the predicted line. It takes time to familiarise yourself with these. In general, we will focus on how to interpret our model coefficients and how to make a prediction based on them.

Using logistic regression

Let’s compare a logistic regression model with a linear model we’re familiar with, using the iris dataset we’re also familiar with.

dat <- iris
dat$set <- ifelse(iris$Species == "setosa", 1, 0)
mod1 <- lm(set ~ Petal.Length + Petal.Width, data = dat)
stargazer::stargazer(mod1, type = "html")
Dependent variable:
set
Petal.Length -0.251***
(0.032)
Petal.Width 0.010
(0.073)
Constant 1.266***
(0.044)
Observations 150
R2 0.852
Adjusted R2 0.849
Residual Std. Error 0.183 (df = 147)
F Statistic 421.497*** (df = 2; 147)
Note: p<0.1; p<0.05; p<0.01


Here, we ran a linear model on a binary outcome (is the iris of species setosa or not?) Let’s make a prediction from our model based upon some imaginary data: an iris with petal length of 5.4cm and width of 2.4cm

newdat <- data.frame(Petal.Length = 5.4,
                     Petal.Width = 2.4)
predict(mod1, newdat)
##          1 
## -0.0675413

As we can see, our model gives us a prediction below zero - does this make sense?

Let’s try again with logistic regression.

mod2 <- glm(set ~ Petal.Length + Petal.Width, data = dat,
            family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
stargazer::stargazer(mod2, type = "html")
Dependent variable:
set
Petal.Length -17.596
(43,449.070)
Petal.Width -33.887
(115,850.800)
Constant 69.447
(43,042.940)
Observations 150
Log Likelihood -0.000
Akaike Inf. Crit. 6.000
Note: p<0.1; p<0.05; p<0.01


So, we get a different set of coefficients for our predictor variables, both of which are now negative… what does this mean? How do we make a prediction? Let’s try with our same imaginary data.

predict(mod2, newdat)
##         1 
## -106.8992

Well, that doesn’t seem too useful - we’re even further below zero than the linear model. Except, if we check the help file for predict.glm()

Let’s try again with the argument type = "response":

predict(mod2, newdat, type = "response")
##            1 
## 2.220446e-16

What’s the difference? By default we get the log odds prediction; if we want to get the probability from this, we need to reverse the link function, which is -

\[p = \frac{e^{\log(odds)}}{1 + e^{\log(odds)}}\]

Let’s try this out on a slightly more normal range of data, petal length = 2.3 and petal width = 0.8.

newdat2 <- data.frame(Petal.Length = 2.3,
                     Petal.Width = 0.8)
predict(mod2, newdat2, type = "response")
##         1 
## 0.8661015
log_odds <- predict(mod2, newdat2)
exp(log_odds)/(1+exp(log_odds))
##         1 
## 0.8661015
1/(1+exp(-log_odds))
##         1 
## 0.8661015

As you can see, we get the same prediction. Interpreting coefficients and predicted values in logit regression is less straightforward than linear regression. We will look into this in greater detail in the weeks to come, but this link has some helpful examples that explain the difference between log odds, odds, and probability, and how we can calculate each with R.

Before we move on, just to drive home the difference between linear and logistic regression, look at the plot below.

ggplot(data = dat, aes(Petal.Length, set)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "red")

Here, we’re plotting the probability of an iris being from the setosa species according to the length of its petals. As we can see, the larger the petal, the lower the probability. However, whereas the linear regression line is not bounded by zero on the y axis, and so the line moves negative above roughly 5cm on the x axis, the logit regression is asymptotic to zero and one on the y axis: no matter how large the petal, the prediction does not go below zero.

Under the hood: optim()

How exactly does R fit this line? We know how linear regression works: we just square the residuals and find the line that minimizes this number. It turned out there was actually a formula that found this solution without all the faff of experimenting:

\[b = \frac{\sum(x-\overline{x})(y-\overline{y})}{\sum(x-\overline{x})^2} \]
\[a = \overline{y}-b\overline{x} \]

Wouldn’t it be great if there was a similar equation for logistic regression? Sadly, there is not. Let’s watch a quick video that explains why.

So, if there’s no equation, how do we work out the best fitting line?

Let’s see how it works in R. We’ll begin by setting a seed for our random data generating functions:

set.seed(2022)
options(scipen = 500)

Next, we’ll create a vector of predictor variables using rnorm():

x <- rnorm(10000) # 1000 random draws from the normal distribution with mean of 0 and 
#sd of 1.

Now, we’ll set the coefficients we want to estimate:

intercept <- 3 # the coefficient of our intercept
slope <- 0.2 # the coefficient of our slope

And finally we want an outcome variable, y, which is binomial (1 or 0). We’re going to make the probability of a 1 or 0 dependent on a combination of the coefficients we just made up.

Because we’re using the binomial distribution, instead of the parameters being mean and standard deviation, we have probability (check ?rbinom() for details.) This is the third argument, which is what the MLE will be trying to maximise (estimate). Note the equation, which is the reverse of our link function, just like we saw above.

y_binom <- rbinom(10000, 1, exp(intercept + slope*x)/(1+exp(intercept + slope*x))) 
# our outcome variable.  

Let’s run the glm:

binom_glm <- glm(y_binom ~ x, family = "binomial")
stargazer::stargazer(binom_glm, type = "html")
Dependent variable:
y_binom
x 0.213***
(0.047)
Constant 3.025***
(0.048)
Observations 10,000
Log Likelihood -1,894.758
Akaike Inf. Crit. 3,793.516
Note: p<0.1; p<0.05; p<0.01


What’s going on under the hood here? We can use the optim() function to find out. optim() minimises (or maximises) a function. Here, we’ll supply the function we used to generate y_binom, but with the two coefficients missing. Compare the results with the glm() function.

f <- function(b) {
  p <- exp(b[1] + b[2]*x)/(1+exp(b[1] + b[2]*x))
  -sum(dbinom(y_binom, 1, p, log=TRUE))
}

ans <- optim(fn = f, par = 0:1, hessian=TRUE)
ans$par
## [1] 3.0254967 0.2126678

Right, I think that’s enough calculus for today. Next week we’ll look at using logistic regression on some real data.

======= Tutorial Guide for Stats II Wk2

GLMs in R

In today’s class we’re going to look at the generalised linear model, which is what we’ll be using for the rest of the module. The GLM allows us to extend the principle of linear regression to different kinds of outcomes using a link function. This is simply an equation that links linear predictors to non-linear outcomes, for instance “yes/no”, counts, etc.

The goal today is more theoretical than usual for our tutorials: we will introduce the glm() function in R, and try to see how it works ‘under the hood’ using what is known as maximum likelihood estimation. MLE is just a way of estimating the parameters of a distribution from the available data; in certain circumstances, the ordinary least squares method we used last term satisfies the objective of MLE.

Learning Outcomes

By the end of today’s class you should:

  1. Know how to use the glm() function in R.
  2. Understand a little about how GLM works, and how it differs from linear regression.
  3. Regret not having learned more calculus.

The glm() function

The glm() function is very similar to the lm() function we’re already familiar with. In fact, you can run a linear regression with the glm() function, because linear regression is part of the same family as the generalised linear model.

stargazer::stargazer(lm(Sepal.Length ~ Sepal.Width + Species, data = iris), type = "html")
Dependent variable:
Sepal.Length
Sepal.Width 0.804***
(0.106)
Speciesversicolor 1.459***
(0.112)
Speciesvirginica 1.947***
(0.100)
Constant 2.251***
(0.370)
Observations 150
R2 0.726
Adjusted R2 0.720
Residual Std. Error 0.438 (df = 146)
F Statistic 128.888*** (df = 3; 146)
Note: p<0.1; p<0.05; p<0.01
stargazer::stargazer(glm(Sepal.Length ~ Sepal.Width + Species, data = iris,
                         family = "gaussian"), type = "html")
Dependent variable:
Sepal.Length
Sepal.Width 0.804***
(0.106)
Speciesversicolor 1.459***
(0.112)
Speciesvirginica 1.947***
(0.100)
Constant 2.251***
(0.370)
Observations 150
Log Likelihood -87.968
Akaike Inf. Crit. 183.937
Note: p<0.1; p<0.05; p<0.01

See how the glm() function, when we specify the family = argument to “gaussian”, gives us almost the same output as the lm() function. This is because linear regression is based on the normal, or Gaussian, distribution. What do you notice is different though?

stargazer::stargazer(glm(am ~ cyl + disp, data = mtcars,
                         family = "binomial"), type = "html")
Dependent variable:
am
cyl 0.743
(0.719)
disp -0.028*
(0.014)
Constant 0.799
(1.992)
Observations 32
Log Likelihood -14.293
Akaike Inf. Crit. 34.586
Note: p<0.1; p<0.05; p<0.01

If we want to use glm() to perform logistic regression, we simply specify the distribution for which logistic regression estimates the parameters, i.e., the binomial distribution.

Logistic (or logit) regression

Let’s quickly review what we use logistic regression for, and how it works. If we have an outcome that can only be true or false, yes or no, then it follows that our estimate for this outcome needs to be bounded between zero (for false/no) and one (for true/yes). It would make no sense to have an estimate less than zero, or more than 1; moreover, if we did get such an estimate, it would suggest that our model was somehow wrong, and that all our other predictions might be off.

Basically, we need a distribution that falls between zero and one, and we need to be able to link the coefficients of our predictors to this distribution. Here’s the distribution:

x <- 1:1000
plot(pbinom(x, size = 1000, prob = 0.2), type = "s", col = "blue",
     main = "Binomial distribution function",
     xlab = "Number of successes", ylab = "F(x)")
lines(pbinom(x, size = 1000, prob = 0.4), type = "s", col = "red")
lines(pbinom(x, size = 1000, prob = 0.6), type = "s", col = "green")
legend("bottomright", 
       legend = c("0.2", "0.4", "0.6"), 
       col = c("blue","red","green"),
       title = "probability")

And here’s the link function:

\[\log\frac{p}{1-p} = \beta X\]

Remember: all the link function does is convert from a probability (0 to 1 scale), which has a curve shape, to log odds (\(-\infty\) to \(\infty\) scale), which is a straight line. We typically use log odds for \(\beta X\) (literally, the log of the odds) because these are symmetrical and normally distributed. Taking the log of the odds also converts the terms in our model from a multiplicative to an additive relationship (i.e., we add the terms rather than multiply them), which is similar to what we are familiar with from linear regression.

It is easy to get confused in logistic regression. There is a difference between odds and odds ratio, log (odds) and log (odds ratio), the logit link function used to convert between probabilities and log odds, and the log likelihood transform used in maximum likelihood estimation to fit the predicted line. It takes time to familiarise yourself with these. In general, we will focus on how to interpret our model coefficients and how to make a prediction based on them.

Using logistic regression

Let’s compare a logistic regression model with a linear model we’re familiar with, using the iris dataset we’re also familiar with.

dat <- iris
dat$set <- ifelse(iris$Species == "setosa", 1, 0)
mod1 <- lm(set ~ Petal.Length + Petal.Width, data = dat)
stargazer::stargazer(mod1, type = "html")
Dependent variable:
set
Petal.Length -0.251***
(0.032)
Petal.Width 0.010
(0.073)
Constant 1.266***
(0.044)
Observations 150
R2 0.852
Adjusted R2 0.849
Residual Std. Error 0.183 (df = 147)
F Statistic 421.497*** (df = 2; 147)
Note: p<0.1; p<0.05; p<0.01


Here, we ran a linear model on a binary outcome (is the iris of species setosa or not?) Let’s make a prediction from our model based upon some imaginary data: an iris with petal length of 5.4cm and width of 2.4cm

newdat <- data.frame(Petal.Length = 5.4,
                     Petal.Width = 2.4)
predict(mod1, newdat)
##          1 
## -0.0675413

As we can see, our model gives us a prediction below zero - does this make sense?

Let’s try again with logistic regression.

mod2 <- glm(set ~ Petal.Length + Petal.Width, data = dat,
            family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
stargazer::stargazer(mod2, type = "html")
Dependent variable:
set
Petal.Length -17.596
(43,449.070)
Petal.Width -33.887
(115,850.800)
Constant 69.447
(43,042.940)
Observations 150
Log Likelihood -0.000
Akaike Inf. Crit. 6.000
Note: p<0.1; p<0.05; p<0.01


So, we get a different set of coefficients for our predictor variables, both of which are now negative… what does this mean? How do we make a prediction? Let’s try with our same imaginary data.

predict(mod2, newdat)
##         1 
## -106.8992

Well, that doesn’t seem too useful - we’re even further below zero than the linear model. Except, if we check the help file for predict.glm()

Let’s try again with the argument type = "response":

predict(mod2, newdat, type = "response")
##            1 
## 2.220446e-16

What’s the difference? By default we get the log odds prediction; if we want to get the probability from this, we need to reverse the link function, which is -

\[p = \frac{e^{\log(odds)}}{1 + e^{\log(odds)}}\]

Let’s try this out on a slightly more normal range of data, petal length = 2.3 and petal width = 0.8.

newdat2 <- data.frame(Petal.Length = 2.3,
                     Petal.Width = 0.8)
predict(mod2, newdat2, type = "response")
##         1 
## 0.8661015
log_odds <- predict(mod2, newdat2)
exp(log_odds)/(1+exp(log_odds))
##         1 
## 0.8661015
1/(1+exp(-log_odds))
##         1 
## 0.8661015

As you can see, we get the same prediction. Interpreting coefficients and predicted values in logit regression is less straightforward than linear regression. We will look into this in greater detail in the weeks to come, but this link has some helpful examples that explain the difference between log odds, odds, and probability, and how we can calculate each with R.

Before we move on, just to drive home the difference between linear and logistic regression, look at the plot below.

ggplot(data = dat, aes(Petal.Length, set)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "red")

Here, we’re plotting the probability of an iris being from the setosa species according to the length of its petals. As we can see, the larger the petal, the lower the probability. However, whereas the linear regression line is not bounded by zero on the y axis, and so the line moves negative above roughly 5cm on the x axis, the logit regression is asymptotic to zero and one on the y axis: no matter how large the petal, the prediction does not go below zero.

Under the hood: optim()

How exactly does R fit this line? We know how linear regression works: we just square the residuals and find the line that minimizes this number. It turned out there was actually a formula that found this solution without all the faff of experimenting:

\[b = \frac{\sum(x-\overline{x})(y-\overline{y})}{\sum(x-\overline{x})^2} \]
\[a = \overline{y}-b\overline{x} \]

Wouldn’t it be great if there was a similar equation for logistic regression? Sadly, there is not. Let’s watch a quick video that explains why.

So, if there’s no equation, how do we work out the best fitting line?

Let’s see how it works in R. We’ll begin by setting a seed for our random data generating functions:

set.seed(2022)
options(scipen = 500)

Next, we’ll create a vector of predictor variables using rnorm():

x <- rnorm(10000) # 1000 random draws from the normal distribution with mean of 0 and 
#sd of 1.

Now, we’ll set the coefficients we want to estimate:

intercept <- 3 # the coefficient of our intercept
slope <- 0.2 # the coefficient of our slope

And finally we want an outcome variable, y, which is binomial (1 or 0). We’re going to make the probability of a 1 or 0 dependent on a combination of the coefficients we just made up.

Because we’re using the binomial distribution, instead of the parameters being mean and standard deviation, we have probability (check ?rbinom() for details.) This is the third argument, which is what the MLE will be trying to maximise (estimate). Note the equation, which is the reverse of our link function, just like we saw above.

y_binom <- rbinom(10000, 1, exp(intercept + slope*x)/(1+exp(intercept + slope*x))) 
# our outcome variable.  

Let’s run the glm:

binom_glm <- glm(y_binom ~ x, family = "binomial")
stargazer::stargazer(binom_glm, type = "html")
Dependent variable:
y_binom
x 0.213***
(0.047)
Constant 3.025***
(0.048)
Observations 10,000
Log Likelihood -1,894.758
Akaike Inf. Crit. 3,793.516
Note: p<0.1; p<0.05; p<0.01


What’s going on under the hood here? We can use the optim() function to find out. optim() minimises (or maximises) a function. Here, we’ll supply the function we used to generate y_binom, but with the two coefficients missing. Compare the results with the glm() function.

f <- function(b) {
  p <- exp(b[1] + b[2]*x)/(1+exp(b[1] + b[2]*x))
  -sum(dbinom(y_binom, 1, p, log=TRUE))
}

ans <- optim(fn = f, par = 0:1, hessian=TRUE)
ans$par
## [1] 3.0254967 0.2126678

Right, I think that’s enough calculus for today. Next week we’ll look at using logistic regression on some real data.

>>>>>>> 0b4ecc09f01cf3ce779016c7676baeb4b5c53720